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Abstract 

A compact scheme is a discretization scheme that is advantageous in obtaining highly 
accurate solutions. However, the resulting systems from compact schemes are tridiag- 
onal systems that are difficult to solve efficiently on parallel computers. Considering 
the almost symmetric Toeplitz structure, a parallel algorithm, simple parallel prefix 
(SPP), is proposed. The SPP algorithm requires less memory than the conventional 
LU decomposition and is highly efficient on parallel machines. It consists of a prefix 
communication pattern and AXPY operations. Both the computation and the commu- 
nication can be truncated without degrading the accuracy when the system is diagonally 
dominant. A formal accuracy study has been conducted to provide a simple truncation 
formula. Experimental results have been measured on a MasPar MP-1 SIMD machine 
and on a Cray 2 vector machine. Experimental results show that the simple paral- 
lel prefix algorithm is a good algorithm for the compact scheme on high-performance 
computers. 


’This research was supported by the National Aeronautics and Space Administration under NASA contract NAS1- 
19480 while the first author was in residence at the Institute for Computer Applications in Science and Engineering 
(ICASE), NASA Langley Research Center, Hampton, VA 23681-0001. 
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1 Introduction 


Recent technological advances have made it possible to build computers that contain thousands of 
processors and can obtain gigaflops (10 9 floating-point operations per second) on real applications. 
Emerging parallel computers are designed to solve large problems and achieve better accuracy 
than could previously be obtained [19]. Parallel computers demand new models, new discretization 
methods, and new algorithms to explore the potential of high-performance computing. 

Conventionally, partial differential equations (PDEs) are discretized by finite-difference or finite- 
element methods, and are solved by Gauss-Seidel, conjugate-gradient, or successive overrelaxation 
(SOR) methods. A new discretization method, the compact finite-difference scheme (or compact- 
difference scheme , compact scheme) was proposed by Kreiss and Oliger [10] and was later im- 
proved upon by Lele [14]. Compared with the traditional finite-difference scheme, the compact 
finite-difference scheme achieves higher accuracy with smaller difference stencils and leads to more 
accurate approximations because of the smaller coefficients on the truncation error. With these ad- 
vantages, the compact scheme has quickly gained popularity. In practice, the resulting discretized 
system of the compact schemes are tridiagonal systems that can be solved efficiently on sequential 
machines. However, tridiagonal systems are difficult to solve efficiently on parallel computers. For 
example, to study the physics of compressible homogeneous turbulence, the CDNS (compressible 
direct simulation of Navier-Stokes) code, based on a sixth-order compact scheme and a third-order 
time discretization, was implemented on the Intel Delta [5]. After carefully choosing an existing 
tridiagonal solver, mapping the algorithm to the architecture, and overlapping communication with 
computation, the communication overhead consumed about 30 percent of the total execution time. 
Clearly, more efficient algorithms are needed to explore the potential of compact schemes on parallel 
computers. 

In recent years, intensive research has been done to develop efficient tridiagonal solvers on 
parallel computers. A good survey can be found in references [15], [7], and [12]. Of the known 
tridiagonal solvers, both the recursive-doubling reduction method (RCD) developed by Stone [17], 
and the odd-even, or cyclic, reduction method (OER) developed by Hockney [8] are able to solve an 
7 i-dimensional tridiagonal system in 0(log(n)) time using n processors. These methods are designed 
for fine-grain computing. Substructured methods developed by Lawrie and Sameh [13], Wang [24], 
and Sun, Zhang, and Ni [23] were designed for median-grain and coarse-grain computing (i.e., the 
case of p < n or p « n, where p is the number of processors available). Lawrie and Sameh’s 
algorithm is designed for shared-memory machines; Wang’s algorithm is designed for distributed- 
memory machines; and Sun et al. proposed three different algorithms, each of which may be a better 
choice depending on the problem and the machine. For compact schemes, the tridiagonal systems 
have a special structure that consists of diagonal dominance and are almost symmetric Toeplitz. 
For this special structure, a parallel tridiagonal solver for fine-grain computing, the simple parallel 
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prefix (SPP) algorithm, is proposed in this paper. 

Compared with the popular tridiagonal solver for fine-grain computing known as the cyclic 
reduction method [8], the SPP method is simple to implement and computationally efficient. It 
requires only 2-log(n) AXPY (vector plus scalar times vector) operations and prefix communication 
patterns. If the tridiagonal system is diagonally dominant, then the AXPY operations can be 
truncated after a certain number of steps without degrading the accuracy. A formal accuracy 
analysis is conducted and simple formulas are provided to compute the number of AXPY operations 
necessary. 

This paper is organized as follows. Section 2 will present the compact scheme and discretized 
tridiagonal systems. Section 3 will introduce three versions of the simple parallel prefix algorithm: 
the SPP for tridiagonal systems with the given special structure, the SPP for solving symmetric 
Toeplitz tridiagonal systems, and the SPP for solving almost symmetric Toeplitz systems. Accuracy 
analysis will be conducted in Section 4. Section 5 will give experimental results on a 16K processing 
elements (PEs) MasPar SIMD computer and on a Cray 2 supercomputer. Finally, Section 6 will 
give the conclusions. 

2 Compact Finite-Difference Schemes 

With the conventional finite-difference and finite-element discretization methods, as the order of 
the approximation increases, the required number of boundary and near-boundary relations and 
the required number of mesh points per derivative stencil increases accordingly. To achieve higher 
accuracy without the use of additional mesh points, the compact scheme [10] was introduced. As 
originally suggested by Kreiss and Oliger [10], and later discussed for fluid dynamics problems by 
Hirsh [6], the first and second derivatives for compact differences may be approximated by 


fn = 


D n 


1 + \hlD + D. 


fn and f n = 


D + D_ 


1 + ±hlD+D. t 


fn 


( 1 ) 


where 


Dofn 2 ^ /n— l)j 


D+fn — h (/n+1 /n)> 


n./ B = 

and h x is the mesh spacing, which is constant for simplicity. By multiplying (1) by the respective 
denominators, relations for the derivatives may be found, which yield 


g/n-l + g/n + g/n+ 1 — ^ ~ fn- 1), 


( 2 ) 
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and 


(3) 


y^/n-l + g/n + y^/n+1 _ h 2 ^ U+1 2 ^ n + 

These equations yield tridiagonal systems when the appropriate boundary conditions are applied. 

To make an accurate comparison between the compact-difference (eqs. (2) and (3)) and the 
standard central-difference schemes, Taylor-series expansions are employed. As Hirsh has shown, 
the truncation error for the compact differences are 

E(A) = W") = 


Similar error analyses for the central differences yield 

e(/o = -^/>;/ w Md eta = 


Although both schemes are fourth-order accurate, the compact-difference scheme should lead to 
more accurate approximations as a result of the smaller coefficients on the truncation error. Similar 
results hold for other higher order approximations. 

As yet, no mention has been made about the boundary treatment for the compact scheme. 
At the boundaries, Hirsh [6] experimented with a variety of boundary conditions, and Adams [1] 
suggested a boundary relation that includes near-boundary derivatives in the formulation. The 
boundary conditions used by both Hirsh and Adams retained the tridiagonal nature of the sys- 
tem. To demonstrate the SPP algorithm, fourth-order one-sided finite differences will be used for 
boundary conditions. 

Many relevant fluid dynamics applications can make use of high-order compact-difference oper- 
ators to numerically solve the governing systems of equations. For example, Burger’s equation, the 
boundary-layer equations, and the driven cavity problem were solved by Hirsh [6] with compact- 
difference operators. Further, Joslin et al. [9] used the compact-difference equations (2) and (3) to 
numerically solve the fully nonlinear Navier-Stokes equations of fluid dynamics. 




(4) 


duj t duj _ _}_dp_ , d 2 u t 
dt U * dxj p dxi V dxjdxj' 

where Ui = {u\^U 2 y Uz) are the velocity components that correspond to the steamwise, normal, and 
spanwise directions, — (^ 1 ^ 2 ? £ 3 ); v is the kinematic viscosity, p is the density, and repeated 
indices infer a summation over the index. To demonstrate how compact-difference operators could 
be employed to solve the nonlinear PDE’s (4) and (5), consider the model problem of the one- 
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dimensional heat-conduction equation: 


du _ d 2 u 
dt K dx 2 


( 6 ) 


To solve this equation computationally, discretizations in time and space must be chosen. From 
the Taylor-series expansions in time, one derives the discrete equation 


u 


n+l 


u n + kA t 


d 2 u 

dx 2 ’ 


(7) 


where n is the number of time levels. If the result at level u n is known, then the solution u’ i+1 
can be obtained if d 2 u/dx 2 can be determined. The spatial derivative can be computed with the 
second-derivative operator (3). Each time-step advancement requires a single compact-difference 
solve. In the original nonlinear PDE systems (4) and (5), both the first and second derivative 
operators are required; the problem is multidimensional and requires a compact-difference solver 
with many right sides. Time-marching is necessary and requires compact-difference solves at each 
time level. This necessitates a fast compact-difference solver. By observation, equations (2) and 
(3) take the matrix form 

’ 1 0 
1 c 1 


x = d 


1 c 1 
0 1 


(8) 


where x = {/',/"} and c = {4, 10} correspond to the compact-difference parameters. The first and 
last rows of equation (8) arise from boundary conditions. The boundary condition that corresponds 
to d ^ + 1 can be rolled into the system by a:;v+i = d^+i and d^ = djv - djv+i- Thus, the system is 
reduced to the N x N system: 


1 0 

1 c 


1 

1 


x = d 


c 1 

1 c 


(9) 


With higher orders of approximation, the resulting matrix will differ only in the boundary 
conditions. However, with the appropriate restructuring given above, the resulting tridiagonal 
systems can be written in the almost symmetric Toeplitz form described in the next section, eqn. 
(10), where A is given by (12). 
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3 The Simple Parallel Prefix Algorithm 

We are interested in solving a tridiagonal linear system of equations 

Ax = d. (10) 


In this system of equations, A is either a symmetric Toeplitz tridiagonal system of order n 


c 1 
1 c 


= [l,c,l], 


1 


L 1 « J 

or an almost symmetric Toeplitz tridiagonal system of order n 


( 11 ) 


’ a 0 
1 c 1 



( 12 ) 


where x = (ij, • • • , x n ) T and d = (di • • • , d n ) T are n-dimensional vectors. We assume that matrix 
A is diagonally dominant (i.e., |c| > 2). Although we assume that A, x, and d have real coefficients, 
the extension to the complex case is straightforward. 


3.1 The Simple Prefix Method 

In this section, we study how to efficiently solve the system 

Ax = d (13) 


on parallel computers, where 

/ a 1 \ / 1 

1 c 

= a 

1 

V 1 c) \ 

and a and b are the real solutions of: 


A = 


b 1 


b 1 


/ 1 b 




b 

1 


= a -[6, 1,0]- [0,1,6], 


a + 6 = c 
a • b = 1 


(14) 


Because a • b = 1 and |c| > 2, we may further assume that |a| > 1 and \b\ < 1. By equation (13), 


x = a" 1 • [0, l,0] -1 d = b • [0, 1 ,b]~ l [b, 1,0]-^. 
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Let L — [-6,0,0]. Then 

[6, 1,0] = [0,1,0] -[-6, 0,0] = /- L 

and 

[6, 1 , 0]~ 1 = (/ + L + L 2 + ■••+ Z/* _1 ) (15) 

= (/ + Z, 2P °* n1-1 )(/+ /,2 n °K"l- J ) (I + L 4 )(I + L' 2 )(I + L). (16) 

Note that n is the dimension of matrix A and that equations (15) and (16) can be verified directly. 
The superscript of matrix L represents matrix multiplication 

/ 0 
0 0 

V = (-*)•' o o 

0 

o o (-by o o 

where the first nonzero element (-6)* is at position (i + 1, 1). Similarly, let U = [0,0, -6]. Then, 

[0,1,6] = [0,1,0]- [0, 0,-6] = /- U, 

and 

[0,1,6]-* = (I+U + U 2 + --- + U n - 1 ) 

= (/+f/ 2n ° snl_1 ) (I+U 2 )(I+ U), 

where 

o o (-by o o 

0 

o o (-by 

0 0 
0 

The first nonzero element of U l is at position (1,*+ 1)- Thus, the solution of equation (13) is 

X = 6 • (/ + t/ 2flogn1_l ) ■ ..(/+{/).(/+ L2no*ni-> J .(r+L)d (17) 

Equation (17) was first used by Chung and Shen [2] in solving 6-spline curve fitting on a Cray 
X/MP computer. 
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Let v = (v\,V 2 , • • •, v n ) T be an n-dimensional vector. Given the special structure of L % , we find 

(/ + !> = u + (-6)‘v w , 

where 

%) = (0, • • -0, Vi, • • ’V n -i) T 

and vj is the i + 1 element of vyy Similarly, given the special structure of U we find 

(/ + U { )v = v + 

where 

V (0 = (t*i+l,-“,«n,0---0) T . 

Equation (17) shows that equation (13) can be solved with 2 • flogn] AXPY operations. Because 
|6| < 1, ||L'|| — ♦ 0 and ||C/'|| — * 0 when n — *• oo, the AXPY operation may be truncated without 
influencing the accuracy. Formulas will be given in Section 4 to compute the smallest truncation 
integer k. A sequential Fortran-like code for solving equation (17) within truncation error is given 
in Figure 1. 


Do 20 i = 1, k 
Do 20 j = n, 2* + 1, -1 
dj = dj + (—bydj_ 2 i 

Do 40 t = 1, k 
Do 40 j = 1 , n — 2* 
dj — dj -f- ( ty* dj+2* 

Do 60 j = 1, n 
ij = b • dj 

Figure 1. The simple prefix method. 


If n processing elements are available and dj is stored in processor j , then the loop 20 and 
loop 40 will lead to prefix computations. Figure 2 shows the prefix computation pattern that 
correspond to loop 40 when n is equal to 8. Loop 20 leads to a similar prefix computation pattern, 
except that the communication is from right to left. Prefix (or recursive doubling) computation 
is a widely used computation model in scientific computing. Any linear recursive relation can be 
computed by recursive doubling [21]. A recursive- doubling algorithm exists for solving tridiagonal 
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systems and involves matrix-matrix multiplications [4, 23]. Compared with the existing recursive- 
doubling algorithm, the computation of our prefix method for solving tridiagonal systems (13) is 
simple. We call the prefix method given in Figure 1 the simple prefix method. Figure 3 shows the 
communication pattern of the widely used cyclic reduction method [8]. In a comparison of figures 
2 and 3, we can see that the prefix method has a relatively simple communication pattern. 



Figure 2. Communication of prefix computation. i 



Figure 3. Communication of cyclic reduction method. 

j ' ' — ’ ; ' ! ' I 


3.2 Modification of Symmetric Toeplitz System 


Our goal is to find the solution of equation (10). Modification is needed to convert the solution of 
equation (13) to the desired solution. The modification will be different for symmetric Toeplitz sys- 
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terns and for almost symmetric Toeplitz systems. For a given symmetric Toeplitz system, equation 
(13) is modified as 


A = A + LA = A + VE t , 


where 

0 \ 
0 0 


0 0 0 

0 0 / 


/ b 
0 


LA = 


(18) 


and V = (6, 0, • • •, 0) T and E = (1,0, • • •, 0) T are n-dimensional vectors. By the matrix-modification 
formula [16, 3, 23], equation (10) can be solved by 


x = A-' l d = (A + VE T )~' l d 
x = A~ x - A~ l V(I + E T A- l V)- 1 E T A~ 1 d 
= x- A~ X V{I + E t A~ x V)~ x x u 


where x\ is the first element of vector x. If the calculation approach given in [18] is followed, we 
have 

?T A- lm-1 


1 


and 


Thus 


i,+ EA ' n -TTCT 


1 = 1 1=1 t=l 


A~ X V{I + E t A~ X V)~ X 


-t*( LLH , ... ( ... < ty - c-* 2 ) ) T „ 9) 

^l_fr 2 (n+l)’V ^ _ fr 2 («+l) » ^ ^ ! — J2(n+l) ’ *) { _ f, 2 (n+l) J ' i 19 ^ 


The final solution is 


X = X - X\Z, 


( 20 ) 


where vector z is the right side of equation (19). Because |6| < 1, z can be truncated at some 
integer k\ without afFecting the accuracy (see section 4). Furthermore, when n is large, 6 2 ( n-,+1 ), 
i = 0, 1, • ■ •, ki will be less than machine accuracy, and z reduces to 


l = ((-h) 2 ,(-6) 3 ,-..,(-6) fc ‘ +2 ,0,--.,0) T . 
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The program of modification is given in Figure 4, and the algorithm for solving the symmetric 
Toeplitz tridiagonal system is given in Figure 5. 



3*3 Modification of Almost Symmetric Toeplitz System 

A similar modification can be given to solve almost symmetric Toeplitz systems. For almost sym- 
metric Toeplitz systems, equation (13) is modified such that 


A = A + AA = A + VE T i 


where 


/ 0 -1 0 \ 

0 0 0 


A A = 


V 


0 0 0 
0 0 
0 / 


( 21 ) 
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V - ( - l»Oi • • •»0) T , and E = (0, 1, 0, • • -0) T . Following the matrix modification formula [23, 18], 
the solution to equation (10) becomes 


x = A~'d = A~ l d- A~ l V(I + E T A- l V)-'E T A-'d 
= x - A~'V{I + E T A~' V)~'x 2 , 

where x 2 is the second element of x. With this new modification, 


(/ + iPA-'v )- 1 


a 

“-DSfMF'’ 


a-' v = -HE ‘ 2i . • ■ ■ , 

t=0 i=0 t=0 

i-'v-(/ + £^-'v)-> = (->)< 1 , • • • , (-yfrlr) 7 - 


The final solution is 


z = 5 - x 2 y , 


( 22 ) 

(23) 


where the vector y is the right side of equation (22). Like the symmetric case, only the addition of 
the first &2 elements in equation (23) may be needed for a given accuracy for some integer and 
lipl-O 

may be less than machine accuracy for 0 < i < ki when n is large. Let 


y = (-6,(-6) 2 ,-.-,(-6) fc S0,---,0) T . (24) 

The modification computation for the almost symmetric Toeplitz system is given in Figure 6. The 
algorithm for solving the almost symmetric Toeplitz systems is similar to the algorithm for solving 
the symmetric Toeplitz system (see Figure 5), except in step 2 the new modification (Figure 6) is 
used to replace the symmetric Toeplitz system modification. 


Do 100 i — 1, &2 

X = Xi- x 2 yi 

Figure 6. Modification for almost symmetric Toeplitz system. 
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4 Accuracy Analysis 


In a previous section, the claim was made that the AXPY operations and modifications can be 
truncated without influencing the accuracy. In this section, we will study the truncation accuracy. 

For simplicity, we truncate loop 20 and loop 40 at the same step k. The norm used in this section 
is the Z/i-norm. 

4.1 Accuracy Study of the Simple Prefix Method . 

Let x,y be exact solutions as | 

; y = b • [6, l,0] -1 d, I 

= (7+L + L 2 + 

1 

x = b ■ [0, 1 , 6] — 1 [6, 1,0] -1 • d 

' = (I + U + U 2 + ■ + U n ~ l ) • y. 

Let y* and x 1 be the solutions given by loop 20 and loop 40, respectively, as 

j f ={I + L + ---+ L k ~ l )-b-d, | 

where k is a power of 2, and \ 

x‘ = (I + U + --- + U k -')y*. I 

Also, we define x* as a hypothetical solution by f 

i 

\ . x* = (/ + {/ + ••■ + U k ~')y. ] 

i _ i 

i - - i 

The difference between x and x* is 

| 

| x-x* =(/ + (/+ 1- U n ~ 1 )y - (I + U + 1- U k ~ 1 )y 

! = {U k + U k+i +•■•+ U n ~ l )y 

l ={U k + ---+U n ~ l )(I-U)x 

1 =(U k -U n )x. 

i 

I _ . . — : 

I 

Thus, we find 

11 < \\U k \\ ■ \\I - t/ H - fc || < |6| fc (l + 

E 11*11 
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Equation (25) gives the relative error between x and x*. Because the difference between x* and x' 
is 

x* - x' = (/+ U + • • • + U k ~ l )y -{I + U+--- + U k ~ 1 )y* 

= (I+U + --- + U k -'){y-y*) 

= (/ + U + • • • + U k - 1 ){L k + • • • + I n_1 )*b'd 
= (/+[/ + ■■■ + U k ~ 1 ){L k + • • • + L n ~ x ){I - L)(I - U)x 
= (/+•••+ U k ~ l )(L k - L n ){I - U)x, 

the norm becomes 

V S • W* • (1 + W*-*)(l + IH). (26) 

If equations (25) and (26) are combined, then the relative error between the exact solution x and 
the truncated solution x' is 



< W*(i + + 1*1)) 

< 2-|6|‘-(l + |6|"-‘). 

4.2 Final Accuracy of Symmetric Toeplitz Systems 

The truncation error of the simple prefix method (Figure 1) will be carried into the modification 
step and will influence the accuracy of the final solution. Let x be the solution of a symmetric 
Toeplitz tridiagonal system (10) and x* be the corresponding solution that has been modified with 
the truncated solution, 


X = X - X\Z, 

x — x — x\z, 


where z = 6 2 , (-6) (-b) n ~ l 

ated by this truncation is 


(V 2 n 

l_fc2(n+l) ) 


T 

; see equation (19). The error gener- 


ic -**ll = ||(x-x') + (x 1 -x\)z\\ 

< ||x-x , || + ||x 1 -x' 1 ||-||^|| 

< ||x-x'||(l + ||z||). 

The norm of vector z can be computed directly as 
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Therefore, 


and 


ll*ll = T -4 w El« , ('-‘ 2{ "- i, )i 


n-1 


1 _ fc2(n+l) j 

b 2 


i=0 
n— 1 


n— 1 


- 1 _ &2(n+l) 

< 


5 «t(E 1*1' + E i‘i 2n ") 


< b‘- 


1=0 i =0 

b 2 (i + i*r +1 xi-i*r) 

(i-w) 

b 2 


- 1 _ 62(n+l) 

l 2 ( 1 - H n ) 


< 


1*1 


(i-|*l n+, )(i-l*l) "1-1*1 W-1 


1*1 


< ||x — X / Il(l + | a | j y) 

_ i'\\ ( l c l ~ 1 ) 
IK |a|- r 

~nu 1 — 1*1 + 1*1 


= x 


= x-x 


^)sll*-*'ll( I q ir >. 


IS - i'll ^ [| i - x ' H , l-W + l * l \ 

^ *« " V 1 1 LI / 


X 


x|| V 1 - 1*1 
X - x'H ||xl|,l-l*| + |*| 2 


(- 


hi iwr 1-1*1 


)• 


(27) 

(28) 
(29) 


(30) 

(31) 


The only unknown in the right side of equation (31) is ||^||, where x is the solution of equation (13) 
and x is the solution of equation (10). For a symmetric Toeplitz system, 

A = A + Aj4, 

where A>1 is given by equation (18). If equations (10) and (13) are combined, we have 

( A + A,4)x = Ax, 


and 


(I + A 1 AA)x = x, 


x 


(32) 


After several calculations, we find 
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A = b 


1 -b b 2 • (-6) n_1 \ 


1 \ 

1 -b ■ (~b) n ~ 2 


-b 1 

. 


b 2 -b 1 

1 -b 


. 

1 J 


*>0 

I 

7 

2 


E”.i * 2 ’ 

0 ■ 

■ 0 > 

EEi ’f-*)* 2 ’ 

0 • 

■ 0 


0 ■ 

■ 0 

(-6)"+' 

0 • 

' 0/ 


Therefore, 

< a ■ ti+|, rir | tn 


and 




(i + ifrr+'Ki - 16[") 

(i - i*d 


The relative error of the solution to equation (10) is 


Ui^£H < 11* -*'11 M! _j_ m > 

11*11 “ 11*11 'Ml'i-W 

. i»i‘(i + i*r-*) (, , (i - i»i*)(i + 1*1)) (. , * 2 (i + ir +1 )(i - i*r)) nn 

- n*i l + iH*j A (i-» 2 )(i-i6i) j (34) 


When the order of the tridiagonal system n is large, ]6| n may be less than machine accuracy. In 
this case, the inequality (34) becomes 


II*- *11 < mi + H n - fc ) (. , (i-16| fc )(i + H) 
INI - i-\b\ \ + 1-|6| 



h l ) 

(i - b 2 )(i - |fr |) ) ' 


(35) 


when truncation is appbed in the modification. In addition to the truncation error carried from 
step 1, the truncated modification will also generate truncation error. If 


( ~f ~f z 

X — X X ^ z , 
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(Figures 4 and (20)), then 


- x' = x - i! - {x\ - x\)z + x\(z - z). 


The Li-norm is given by 


which leads to 


In equation (36), 


I* - x'|| < ||*-*1| + ||*-*1|-||*|| + ||*||-||*-*| 

< ||* -s'lKi + ||*||) + ||*|| -II*- *||, 


I* - *11 ^ Ik *11 ll*ll/i | n=in + ih - ; ii 


*-* = 


1 _ fc2(n+l) ■ 


ERl-i 2M )l 

nl-1 


l _ £2(n+l) 


■£ 16^(1 -6) 2( " , -»)|, 


where j = i — fci, nl = n — k\. Then 


, ... . i ^ i+2 (i + ir 1 + 1 )(i-M 

'* *»« - 1 _ &2(n+l) ‘ (1 _ |6|) 

|6| fcl+2 (1 + |6| nl+1 )(l - |6| nl ) 

- l - (1 - |6|) 

|6|*fha 

- 1- 161- 


Note that ||5|| < ||z||, so that we have the inequality 


I* - i'll Ik 


& ,+ j w i ,+ .S 


, 1*1* (, , (1-W t )(i + WA f, . * 

s thsiv t-l*i ) + 0 - m - 


\ |6|*‘+ 2 
|6|)j + 1-6 


The error introduced by the truncated modification is insignificant if k\ > k — 2. In practice, we 
can choose k x = k and use the inequality (34) to compute the truncation number k. 

4.3 Accuracy of Solving Almost Symmetric Toeplitz Systems 

Following the similar arguments given in last section, error bounds can be obtained for almost 
symmetric Toeplitz systems. For almost symmetric systems, we have a different modification 
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vector y and a different modification matrix A A. From equation (23), the exact solution to the 
almost symmetric Toeplitz system is 

x - x - x 2 y , 


where 


l-b 2 

y = (-*,(-*) - 


The norm of vector y is 

Ml = £ 


1=1 


(-*)• 


. 1 _ 0n— »'+l) 


1 - b 2n 


< nW£ |6|i+ £ 6!,n+,) ' ir ‘) 


1 = 1 1=1 


< 


i / wu-wxi + w+'A < _w_ 
i-^V (i - l*o y-i-w 


Let x* be the solution modified with the truncated solution as 


x 


* 


= x r 


*22/, 


where x\x 2 are the same as defined in Section 4.1. Then, following a similar deduction for the 
inequality (29), we have 


II*- **|| < II* -*'ll(i + Ml), 

II*- *11 < ll*-*'ll ll*ll n , W X 
11*11 - 11*11 'lie i - i*r 

||x-x , || ||*|L 1 X 

- ||*|| ll*ll 1 - b ’ 

The above inequality is in the same form as inequality (31), except that the ratio |j||| is different 
for the almost symmetric system. For the almost symmetric Toeplitz system, 


x = A + A,4, 


where A^4 is defined by equation (21). By 


A~ l AA = 


/0 

0 


V o 


direct calculation, we find 

EICo 1 -6 • b 2 ' o • 0 \ 

r:= 0 \-b) 2 b 2i o • o 

(-6) n 0 • 0 ) 
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Therefore, 


n— 1 


p-^ii < 53 


*=o 


(-6) ,+1 (l - 6 2 ( ,l -‘)) 


(l-* 2 ) 


< 


itKi+ur'xi-ir) 

(1 _ 6 2)(! _ |(,|) ■ 

For the almost symmetric system, 

M<||/ + i- 1 A J 4||< 1 + 114-^11. 

(See inequality (32).) We conclude that 

||z-z*|| < |6|*(1 + |6|"-*) ( x | (1 - I6| fc )(l + |6|) ^ ^ | 161(1 + 161^X1-161”) ^ (37) 


1-1*1 


1 - 1*1 


(1 _*2)(1 _ 1*1) 


< J^f 1+ (1-I*l fc )(l+I*[) )(1+ 


1*1 


1 - 1*1 


1 - 1*1 


( 1-* 2 )(1 




(38) 


When compared with inequality (34), the numerator of the last term in inequality (38) is slightly 
increased. 

If the modification is truncated to (Figure 6), then 


x f = x 1 — x\y. 


Following the derivation of inequality (36), we have 

'ia-s'll . I|* -*'11 ||£|| 

INI 'INI' 


where 


< 


(i + ll»ll) + lly - »ll» 


n,- m = 


i=k2 


1 - b 2n 


< JTjS £ l‘’(l - ‘ 2( "- i, )l 
1 = ^2 

|6| fca+I (1 + |&|«-*a-i)(i - |ft|n-*a) 


n— 1 


< 


< 


1 - b 2n 
| 6| fca+1 
1 - 1 * 1 ' 


( 1 - 1 * 1 ) 


(39) 
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Thus, 


lis-^ll < Jt (i , (i-W*)(i + HA (. , H ^ . 1^1 
11*11 “ l-l&l V 1 - 1*1 A + (1-62)(1-|6|)J + 1-|6|' 


(40) 


Similar to the symmetric case, the error introduced by modification truncation is insignificant 
if &2 > k - 1. We could choose k 2 — k and use inequality (38) for error estimates. 


5 Experimental Results 

In this section, the performance of the SPP algorithm on the MasPar parallel computer and on the 
Cray vector computer will be presented. 

5.1 Parallel Computing 

The MasPar MP-1 is a distributed-memory massively parallel SIMD computer with a high-speed 
two-dimensional toroidal mesh topology. A control unit (ACU) has a direct connection to all 
the processing elements (PEs) and issues instructions at a 12.5 MHz clock rate. Each processing 
element in the array is a 4-bit custom load and storage processor with a minimum of 16 kilobytes 
of memory. Communication is relatively cheap on the MasPar. For example, on the MasPar M-l, 
a double-precision multiplication function is ten times more expensive than sending the product to 
an adjacent PE. 

Table 1 gives the computation and communication count of the simple parallel prefix (SPP) 
algorithm. We assume that both the AXPY operations and the modification are truncated after k 
operations; the modification vector used is either equation (20) or (24), respectively. The b 2 \i = 
1 5 are computed sequentially, or redundantly, on each PE. The modification vector z or y 

will be computed concurrently on different PEs. We count a power-function computation as 8 
floating-point operations. So, the modification phase costs 10 parallel operations in total. The 
calculation of b (equation (14)) is not considered. In the computing phase, 2 *log(A;) parallel, one- 
to-one communications are required. We use a to represent the one-to-one communication. On 
the MasPar, the one-to-one communication is achieved by using the router command. We use ft 
to represent the broadcast. On the MasPar, the broadcast is achieved by transferring the local 
data to ACU and then distributing it to all PEs. One broadcast communication is needed in the 
modification phase. Because the tridiagonal systems that arise in the compact scheme have multiple 
right sides, the computation and communication count for solving multiple right-side systems is 
also listed in Table 1, where the computation of b 2 and the modification vector are not considered. 
Note that rt\ is the number of right sides. 

Two sample matrices are chosen to illustrate the performance of the SPP algorithm and to 
verify the theoretical error bounds given in the previous section. Both of the matrices are almost 
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Table 1. Computation and communication count of the Simple Prefix Algorithm 


System 

Best 

sequential 

SPP 

Computation 

Communication 

Single system 

8n-7 

5-log(fc) 4- 10 

2-log(fc) • a + (5 

Multiple right sides 

(5 n — 3) * nl 

(4-log(fc) + 2) * nl 

(2-log(fc) • a + /?) * nl 


symmetric Toeplitz matrices that arise in the compact schemes. One matrix is 

Ai = [1,3,1]- Ai. 


The other matrix is 

A 2 = [1,10,1]- Ai. 

Equation (18) defines Ai. The corresponding solution of equation (14) is 6 = and b = 7 - ~)p^ 
for A\ and A 2 , respectively. The error is measured relative to the LU solution. The accuracy 
comparison for solving system A\ is given in Figure 7. In this implementation, no truncation is 
implemented in the modification phase, and the prediction formula used is equation (38). In solving 
A 2j the modification is applied at the modification stage with k 2 = k, and the prediction formula 
used is equation (39). The accuracy comparison for solving system A 2 is given in Figure 8. From 
Figures 7 and 8, we can see that the theoretical bound matches the measured results well. 



Figure 7. Measured and predicted accuracy for solving matrix A\ . 


Figure 9 shows the speedup of the SPP algorithm over the best sequential algorithm for solving 
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Figure 8. Measured and predicted accuracy for solving matrix A 2 . 

a single system. The best sequential algorithm is based on the LU decomposition, and is fine- 
tuned to take advantage of the almost symmetric Toeplitz structure. All computations are double 
precision. Truncation numbers k = 64 and k = 16 are chosen to achieve double-precision (10 -16 ) 
and single-precision (10~ 7 ) accuracy, respectively. The order of matrix is the same as the number 
of PEs available. Because of the high-speed communication, with n equal to Ik, 2k, 4k, 8 k, and 
1 6A-, the execution time is not noticeably changed in parallel processing. The sequential algorithm 
is implemented on a single PE. Because of memory limitations, only small systems are solved by the 
sequential algorithm. The data used in Figure 9 is predicted based on the small-size timing. Figure 
10 shows the corresponding speedup of solving a system with 1024 right sides. The factorization of 
the matrix is not included in timing for solving the system with multiple right sides. The speedup 
is slightly higher for solving multiple right-side systems. 

Because the order of the matrix increases linearly with the number of PEs available, the speedups 
given by Figures 9 and 10 are memory-bounded speedup [20]. From table 1, the problem size, in 
terms of floating-point operations, is a linear function of the order of the matrix. The linear memory- 
bounded speedups given by Figures 9 and 10 indicate a linear speed increase. In accordance with 
the isospeed metric of scalability [22], the SPP algorithm is perfectly scalable on the SIMD MasPar 
machine. Because the isoefficiency metric [11] is equivalent to the isospeed metric when serial 
execution has a fixed speed [22], the SPP algorithm is also perfectly scalable with the isoefficiency 
metric. The reader may refer to [22] and [11] for more information regarding scalability of parallel 
algorithm-machine combinations. 
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Figure 9. Speedup over the best sequential algorithm on single system. 



Figure 10. Speedup over the best sequential algorithm on system with 1024 right sides. 


22 





5.2 Vector Computing 

Vector computing is widely used at national laboratories, universities, and supercomputing cen- 
ters for large-scale computing applications. For this reason, a CRAY-2S/4-128 at NASA Langley 
Research Center was also used to test the SPP algorithm on a vector machine. The Cray -2 no- 
tation “S” indicates that the memory is static rather than dynamic, and “4-128” indicates that 
the machine has 4 processors and 128 million 64-bit words of central memory. Each CPU is a 
register-to-register vector processor with a 4.1 nsec minor cycle clock that can generate 100-300 
megaflops. The four processors can be used for a single problem (multi-tasking) to achieve over 1 
gigaflop of performance. 

The speed of a vector machine depends on the vector length, vector stride, and the computa- 
tional richness of the do loops. Because the vector register length is 64 and the CPU is extremely 
fast in carrying out floating-point operations, once operands are in the registers, best performance 
can be obtained with do-loop which have lengths that are multiples of 64, which are computationally 
intensive, and which use unit stride (separation of memory between elements). 

The chosen sample tridiagonal matrices are almost symmetric Toeplitz and correspond to the 
first and second derivative compact-difference operators (2) and (3). The diagonals and necessary 
coefficients are: 

A 3 = [ 1,4,1] with a, b = 2 ± \/3 

A 2 = [1, 10, 1] with a, b = 5 ± 2y/6 

For the first experiment on the Cray, single tridiagonal solves were made. The test problem 
used here and in the rest of this section corresponds to f(x) = 3x 3 — 2x + 1 , which has smooth 
exact derivatives f f (x) = 9x 2 — 2 and f n {x ) = 18x. Figure 11 shows the performance of the 
SPP in terms of CPU seconds and the matrix order compared with the LU decomposition for 
computing /' and /". In addition to the reduced memory requirements of SPP compared to LU, 
the performance shown in Figure 11 clearly indicated that the SPP is faster on the vector machine 
than the conventional LU solver; the benefits increase with the operator size. The significant 
difference between the SPP and LU timing can be explained in light of vector operations versus 
scalar operations. The SPP approach can be vectorized over the direction of the solve; the LU 
approach must use scalar operations. For the SPP approach, note that the diagonal dominance of 
the second-derivative operator /" leads to faster computations compared with the first-derivative 
operator f f . This time reduction results from the truncation of the SPP approach to obtain a 
predetermined level of error (10“ 14 ), which is essentially machine precision. For the first-derivative 
operator (A 3 ), k = 32 and k\ — 24; for the second-derivative operator (A 2 ), k = 16 and k\ = 16, 
where k and k\ are the truncation numbers on the solving and modification phases, respectively. 

Real applications which use compact-difference operators require many tridiagonal solves that 
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Figure 11. Timing of SPP and LU algorithms: single system. 

correspond to time-marching algorithms and involve many right sides corresponding to the mul- 
tidimensionality of the application. In this second evaluation, with the same accuracy 10" 14 , the 
performance of the SPP is compared with LU for multiple right sides. Shown in Figure 12 are 
CPU times for the SPP and LU for various orders of the second-derivative compact operator A 2 . 
(Similar results were obtained with the operator A 3 but are not shown.) For applications that 
use small operators (N < 96), the LU solver is more efficient than SPP; for applications that use 
large operators (N > 96), the SPP is much cheaper than the LU approach. This difference occurs 
because the LU approach vectorizes the do loop associated with the number of right sides, and 
the SPP vectorizes in the direction of the tridiagonal solve. With some creative programming, one 
could potentially vectorize the entire SPP approach with a single array, while the LU approach can 
vectorize over the right-side arrays. 

In the final experiment, the ability of SPP to control truncation error is demonstrated. The 
highest order of accuracy in the solution is based on the truncation error of the compact-difference 
approaches in equations (2) and (3). As a result, to require machine-zero is overkill for the compact 
solver and leads to unnecessary computational cost. By using the inequality (40), the choice of 
truncation can be determined based on a desired error bound. Figure 13 shows the SPP results 
of truncations k = 8 and k = 32, which correspond to errors 10 -5 and 10 -14 , respectively. If the 
accuracy of the SPP is relaxed, the computational cost decreases by a factor of 2. 
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Figure 12. Timing of SPP and LU algorithms: multiple right sides. 



Figure 13. Timing of SPP with different accuracies. 
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6 Conclusion 


A central goal of parallel processing is to achieve better, more accurate solutions. Because obtaining 
more accurate solutions, in general, means adding more discretization points, larger systems result 
and require greater computational power. The accuracy of a simulation solution is also bounded 
by the discretization scheme used. A clear requirement for obtaining a more accurate solution is to 
adopt discretization methods with high-order accuracy. Previously, a highly accurate discretization 
scheme, the compact finite- difference scheme , has been proposed. However, the almost symmetric 
Toeplitz tridiagonal systems that arise from compact schemes are sequential in nature and difficult 
to solve efficiently on parallel computers. In this paper, we have introduced a parallel algorithm, 
the simple parallel prefix (SPP) algorithm, for compact schemes. 

The SPP algorithm is designed for fine-grain computing. With n processors, the SPP algorithm 
solves an n-dimensional system with 2log(n) + 1 AXPY operations. Two prefix communications 
are required in the solving phase and one broadcast communication is required in the modification 
phase. In comparison with existing tridiagonal solvers [17, 8], the SPP algorithm is simple in 
computing and simple in communication. It requires storage of only one log(n)-dimensional vector 
for the computing phase and one n-dimensional vector for the modification phase. When the 
tridiagonal system is diagonally dominant, both the computing and the modification phases can 
be truncated without degrading the accuracy. Memory requirements will be further reduced when 
truncation is applied. A detailed accuracy analysis has been conducted to find the appropriate 
truncation number. Experimental results show that the SPP algorithm achieves a speedup greater 
than 1000 over the best sequential algorithm on a 16K PEs MasPar M-l SIMD parallel computer. 
In addition to the good performance on the SIMD machines, the SPP algorithm also out performs 
the best sequential algorithm on a vector machine (Cray 2), even on systems with multiple right 
sides. Experimental and theoretical results show that the SPP algorithm is a good choice for 
compact schemes and for the emerging high-performance parallel computers. 

The SPP algorithm is designed for almost symmetric Toeplitz tridiagonal systems. It can be 
modified for different boundary conditions and for cases where the number of processors p is less 
than the dimension of the system. However, generalization of the algorithm for general tridiagonal 
systems or for band systems is unlikely. 

The work presented in this paper is a continuation of efforts to design efficient parallel solvers 
foT compact scheme. An efficient solver, the PDD algorithm, for coarse- or median-grain computing 
has been proposed [18]. The PDD algorithm and the SPP algorithm can be combined on parallel 
machines with vector processing units. 
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